Skip to content

fix(node): close under-withholding via full ref scope and reachable-only pin set#46

Open
beardthelion wants to merge 65 commits into
Gitlawb:mainfrom
beardthelion:fix/withholding-ref-scope
Open

fix(node): close under-withholding via full ref scope and reachable-only pin set#46
beardthelion wants to merge 65 commits into
Gitlawb:mainfrom
beardthelion:fix/withholding-ref-scope

Conversation

@beardthelion

@beardthelion beardthelion commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Closes #42.

blob_paths in visibility_pack.rs builds the blob→path map that every withholding decision rests on. Two flaws let it produce an incomplete withheld set, which downstream code then treats as public, leaking withheld content in cleartext:

  1. Ref-namespace gap. It walked only refs/heads/* and refs/tags/*, but the serve path enumerates git rev-list --objects --all (every object reachable from any ref). A blob reachable only through another namespace (e.g. refs/notes/*) never entered the withheld set and shipped in cleartext. Fix: drop the heads/tags filter and walk every ref, still per-ref via git ls-tree -r so all paths a blob lives at are seen (a blob is withheld only if denied at every path; if it is public at some allowed path it must still be sent). Note this is deliberately not rev-list --objects, which reports one representative path per object and would regress that correctness.

  2. Fail-open ref walk. A non-zero git ls-tree exit was silently skipped, yielding a partial set. Fix: fail closed (bail!). All three callers already abort on error (serve propagates, push/encrypt-pin skip the block), so no path serves unfiltered.

It also narrows the two pin enumerators (ipfs_pin, pinata) from git cat-file --batch-all-objects to git rev-list --objects --all. The old call included unreachable/dangling objects, which have no ref and no tree path, so visibility rules cannot classify them; a dangling copy of a secret blob would bypass withholding straight into a permanent IPFS pin. Reachable-only keeps the pin set aligned with what blob_paths can evaluate.

Four tests added: a blob reachable only via a non-head/tag ref is still withheld; an untraversable ref makes the withheld computation return Err; and a dangling blob is absent from each pin enumerator while a committed blob is present.

Stacks on #40 (base=main), carrying the cumulative Phase-2/3 prereq diff since these files exist only in the unmerged stack.

Summary by CodeRabbit

  • New Features
    • Added encrypted storage and retrieval for withheld blobs
    • New gl clone command supporting partial clones with content filtering and automatic blob recovery
    • Added API endpoints for querying encrypted blobs and withheld paths
    • Enhanced sync functionality to replicate encrypted content across mirrors

…al clone

upload_pack_excluding emitted a v2 packfile section, but info_refs
advertises v0, so real clients negotiated v0 and rejected the response
with 'expected ACK/NAK, got packfile'. Frame the v0 stateless-rpc shape
instead (NAK, then the pack via side-band-64k when offered).

Add an end-to-end test that stands up info_refs + upload_pack_excluding
and runs a real git partial clone, asserting the withheld blob's bytes
never reach the client while its tree entry and SHA stay visible. A stock
full clone cannot consume the pack (it is not closed under reachability,
so fetch fails the connectivity check); a partial clone is required.
…tion choice

Add a real-git test that partial-clones, pushes a new commit server-side,
then fetches: the new object arrives and the withheld blob stays absent.
This pins down that ignoring have/want negotiation (always sending a
self-contained pack of all refs minus withheld, with NAK) is correct for
both clone and fetch; the only cost is a fetch re-sends the full object
set. Refactor the real-git tests onto a shared server harness and document
the negotiation decision in code and in the plan's follow-ups.
Move the two blocking git shell-outs in the filtered upload-pack path off
the async worker thread, matching the tokio::process / spawn_blocking usage
already in this file: build_filtered_pack (rev-list + pack-objects) and
withheld_blob_oids (per-ref ls-tree) now run inside spawn_blocking so a large
repo cannot stall the tokio runtime. Behavior is unchanged.

Also fix the Task 0 findings block in the Phase 3 plan: it still recorded v2
packfile framing, which is the exact path that failed against a real client
and was corrected to v0. The block now documents the shipped v0 contract.
Drop a stray trailing code fence flagged by markdownlint (MD040).

The speculative ls-tree timeout and the public/no-rules fast-path from the
review are intentionally left out: the timeout guards against adversarial
repos we do not yet host, and the fast-path is a micro-optimization not worth
the extra branch right now.
kevincodex1 asked to keep the superpowers planning docs out of the repo. The
Phase 3 plan was scaffolding for this change, not something the project needs
to carry. Removing it leaves only the code and tests in the PR.
Three fixes from the PR Gitlawb#33 review:

- withheld_paths now applies the whole-repo "/" read gate (returns
  repo-not-found when the caller cannot read the root), matching the git read
  endpoints. Without it the endpoint disclosed a private repo's existence and
  path layout to unauthorized callers. The withheld_globs doc already assumed
  this gate existed; now it does.

- A nested allow under a denied parent (e.g. "/secret/public/**" allowed,
  "/secret/**" denied) was over-withheld: the client sparse-excluded the whole
  parent and hid paths the caller may read. The endpoint now also returns a
  "reinclude" list (allowed globs strictly under a denied one) and gl clone
  re-includes them in the sparse spec after the excludes.

- Wildcard-free globs like "/docs/private" match both the exact path and a
  subtree (per glob_matches), but the client only emitted the subtree exclude.
  sparse_patterns now emits both "/docs/private" and "/docs/private/".

Verified the exclude-then-reinclude sparse ordering checks out cleanly with
real git, plus unit tests for reincluded_globs, the nested re-include, the
exact-path exclude, and sparse_patterns.
…eld-path errors

split_once('/') accepted owner/name/extra, smuggling a path segment into
the repo name that then flowed into the API path and clone URL; reject it.

fetch_withheld swallowed every network/auth/5xx/JSON error into an empty
result, dropping to a stock clone that the node refuses once blobs are
withheld. Now only 404/501 (endpoint unsupported) fall back to empty; the
rest propagate so the real cause surfaces.
The receive-pack replication chokepoint called withheld_blob_oids
directly on the tokio worker, where its blocking git ls-tree walk can
stall the runtime for repos with many refs. Wrap it in spawn_blocking
to match the upload-pack serve path.
Add a regression test for deny /secret, allow /secret/public, deny
/secret/public/admin and clarify the sparse-checkout comment. git does
not re-traverse an explicitly excluded directory, so emitting all
excludes before re-includes keeps the deepest deny in force; the broader
parent re-include does not resurrect it.
…hema

Read the default branch from the local origin/HEAD symref clone already
set, instead of parsing the localized, network-dependent output of
git remote show origin. Deserialize the withheld-paths body into a typed
struct so a missing or mistyped withheld/reinclude field is a hard error
rather than silently becoming an empty list, which would mask a server
regression behind a later clone failure.
beardthelion and others added 21 commits June 10, 2026 18:52
…or present-check

Add two hermetic integration tests for recover_from_arweave that drive the
full read path over mocked Arweave GraphQL + IPFS gateways: discover the
manifest, fetch it, fetch the envelope, decrypt, and install the withheld
blob. One covers an authorized recipient (blob installed), the other a
non-recipient (nothing recovered). Both simulate origin death by removing
the promisor remote and enable uploadpack.allowFilter so the blob is truly
withheld over file://.

Also harden the local presence check in recover_from_arweave with
GIT_NO_LAZY_FETCH=1 and .output() so the expected 'missing object' case does
not trigger a wasted promisor fetch or leak git stderr to the user.
db: parse_recipients now propagates a descriptive error instead of
defaulting corrupt recipients JSON to an empty list, which would have
denied authorized readers and handed peers incomplete metadata.

gl: clone recovery now warns when the sparse-checkout file cannot be read
or written, or when the post-recovery checkout fails, instead of silently
discarding those errors and claiming files were recovered.
Envelope v2 drops the cleartext recipient public key (kid) from each
wrapped-key header entry, so a party holding the public IPFS/Arweave copy
can no longer enumerate who is authorized to decrypt a withheld blob. The
version is bumped to 2 and v1 envelopes are rejected. open_blob now selects
the reader's entry by trial decryption (the AEAD tag authenticates exactly
one entry) instead of matching on the public key.

Recipient count is still visible; identity is not. Scope is envelope-only,
the node DB recipients column and peer replication metadata are unchanged.

Co-authored-by: CommandCodeBot <noreply@commandcode.ai>
open_blob fed attacker-controlled envelopes to GenericArray::from_slice and
XNonce::from_slice, which panic on a wrong-length input. Validate both the
per-recipient box nonce and the body nonce to 24 bytes before use: skip a
recipient entry whose nonce is malformed, and return an error for a malformed
body nonce, so the public recovery path surfaces an error rather than panicking.
The encrypted-blobs/replicate endpoint shipped every blob's full recipient
DID list to mirrors, which persisted it, so any mirroring peer learned the
reader set. The v2 envelope blinding already removed recipient public keys
from the pinned envelopes, so the comment justifying this (DIDs are already
public) was no longer true.

/replicate now returns {oid, cid} only. Mirrors detect a re-seal by the CID
changing (the OID is stable across re-seals) instead of comparing recipient
sets, and store no recipient identities. Origin-side authz and the at-rest
recipients column are unchanged; this blinds only the peer-facing surface.
The B3 encrypted-blob manifest anchored {oid, cid, recipients} to Arweave, a
permanent public record of every blob's reader set. The v2 envelope blinding
already removed recipient keys from the pinned envelopes, so the comment
justifying this (recipient DIDs are already public) was no longer true. The
manifest now anchors {oid, cid} only. The gl reader already ignores the
recipients field (it recovers by trial decryption), so no reader changes.
…adability

The origin no longer stores recipient DIDs. Migration v5 replaces the
encrypted_blobs.recipients column with an opaque, node-keyed recipients_tag
used only to detect a recipient-set change for re-seal. B1 discovery and fetch
are now gated by the same repo-readability check the git read path uses, not by
per-recipient matching; decryption is gated by the envelope crypto, so a
non-recipient who can read the repo sees a blob's {oid, cid} but cannot open it.
encrypt_and_pin keys the tag from the node seed and returns {oid, cid}; the
Arweave manifest tuple drops the now-unused recipient vec.

A DB compromise no longer reveals the reader set; recovering it would require
brute-forcing candidate DID sets against the keyed tag with the node key.
…rror

Address review on the at-rest blinding change:
- The encrypted-blobs/replicate listing returned {oid, cid} with no visibility
  check, so a non-readable repo's blob index was reachable by an unauthenticated
  caller who guessed {owner}/{repo}. Gate it by the same repo-readability check
  discovery and fetch use. For the intended case (a public repo with withheld
  subtrees) the public root keeps this open to peers; only fully non-readable
  repos are withheld, which is the desired behavior.
- encrypt_and_pin treated a recipients_tag DB read error as a cache miss and
  resealed, causing avoidable IPFS writes during a partial outage; skip and retry
  on the next push instead.
- Correct the get_encrypted_blob doc comment to describe repo-readability access.
…nly pin set

blob_paths walked only refs/heads/* and refs/tags/* and skipped silently on a
failed git ls-tree, so a blob reachable only through another namespace, or a ref
that failed to traverse, could fall out of the withheld set and ship in cleartext.
Walk every ref and fail closed on traversal error.

The pin enumerators (ipfs_pin, pinata) used git cat-file --batch-all-objects,
which includes unreachable/dangling objects that have no path and cannot be
classified for withholding. Switch them to git rev-list --objects --all so the
pin set matches the reachable graph blob_paths evaluates.
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements end-to-end encryption and withholding for Phase 3 confidentiality: withheld blobs are sealed to recipient keys via ephemeral X25519 envelopes, encrypted with XChaCha20Poly1305, pinned to IPFS with opaque recipient tags, anchored to Arweave, replicated across mirrors using promisor-based filtering, and recovered locally via a new gl clone command with node and Arweave fallback paths.

Changes

Core encryption and visibility foundations

Layer / File(s) Summary
Envelope encryption with X25519 key agreement
crates/gitlawb-core/src/encrypt.rs, crates/gitlawb-core/Cargo.toml
New seal_blob/open_blob module implements multi-recipient envelope encryption: seal_blob generates random content key, encrypts plaintext with XChaCha20Poly1305, wraps content key per recipient via ephemeral X25519 and crypto_box, and outputs magic+version+header length+blinded JSON header+encrypted body. open_blob validates envelope structure, derives X25519 secret from keypair seed, attempts AEAD decryption per recipient entry, and decrypts body. Tests cover key agreement, round-trips, tampering, header blinding, and nonce-length protection. Dependencies: curve25519-dalek, crypto_box, chacha20poly1305.
Keypair seed exposure and module wiring
crates/gitlawb-core/src/identity.rs, crates/gitlawb-core/src/lib.rs, crates/gitlawb-node/Cargo.toml
Keypair::seed_bytes() returns raw 32-byte Ed25519 seed for X25519 secret derivation. Encrypt module exported in crate root. Node workspace adds ed25519-dalek dependency.
Visibility-aware blob OID and recipient computation
crates/gitlawb-node/src/git/visibility_pack.rs
New module enumerates reachable blobs via git ls-tree -r per ref, computes withheld OID sets by evaluating visibility rules at each blob path (fail-closed on ref-walk errors), filters object lists for replication, and derives per-blob recipient DIDs as owner_did + reader DIDs allowed at any path. Comprehensive tests cover anonymous/owner/reader contexts, recipient computation, seal/open round-trips, and nonstandard ref namespaces.
Sparse checkout glob helpers and announce gate logic
crates/gitlawb-node/src/visibility.rs
New withheld_globs and reincluded_globs functions compute glob patterns for sparse-checkout exclusion/re-inclusion. Tests validate announce-gate logic (anonymous read of "/" gates replication announcement).

Server-side filtered pack serving and encrypted blob management

Layer / File(s) Summary
Filtered pack generation and git protocol v0 framing
crates/gitlawb-node/src/git/smart_http.rs
New build_filtered_pack shells out to rev-list/pack-objects to exclude withheld OIDs. upload_pack_excluding serves filtered packs in blocking task with protocol v0 NAK framing, optional side-band-64k band-1 wrapping, and comprehensive unit/integration tests using local Axum server, real git clone --filter=blob:none, and incremental fetch scenarios.
Upload-pack visibility enforcement in repos.rs
crates/gitlawb-node/src/api/repos.rs (partial)
git_upload_pack computes withheld blob OIDs via spawn_blocking, routes to normal or filtered pack handler, and remaps errors (protocol issues → BadRequest, git failures → Git).
Encrypt-and-pin with recipient derivation and DB recording
crates/gitlawb-node/src/encrypted_pin.rs, crates/gitlawb-node/src/main.rs
New module computes opaque HMAC-SHA256 recipients_tag (node-keyed, order-insensitive); skips re-sealing when tag matches DB record; otherwise resolves recipient DIDs to keys, reads blob bytes, seals via core encrypt module, pins envelope to IPFS, and records (oid, cid, tag). Best-effort per-blob error handling with logging. Tests validate tag properties and keying.
IPFS and Pinata object filtering and CID retrieval
crates/gitlawb-node/src/ipfs_pin.rs, crates/gitlawb-node/src/pinata.rs
Both modules now accept withheld: &HashSet<String> parameter, filter enumerated objects through replicable_objects, and update list_all_objects to use git rev-list --objects --all instead of batch-all-objects (only reachable objects). IPFS adds new cat function to fetch CID bytes. Tests assert unreachable blobs excluded.
Per-push encrypted manifest anchoring to Arweave
crates/gitlawb-node/src/arweave.rs
New EncryptedManifest struct and anchor_encrypted_manifest function build JSON payload with repo/owner/node/timestamp metadata and blob entries (oid, cid only; no recipients), send to Irys endpoint, return transaction id. No-ops when irys_url empty or blob list empty. Tests mock successful anchoring and validate missing recipient fields.
Receive-pack Phase 2 replication enforcement and propagation gating
crates/gitlawb-node/src/api/repos.rs (main change)
New replication logic in git_receive_pack determines announce eligibility and optional withheld set from visibility rules. When announcable, conditionally spawns IPFS pinning of new objects, calls encrypt_and_pin, and attempts manifest anchoring (all best-effort). Downstream gossipsub and HTTP peer sync notifications are suppressed unless announce true. Arweave permanent anchoring gated by announce && !irys_url.is_empty().
Database migrations and encrypted blob query/write methods
crates/gitlawb-node/src/db/mod.rs
Schema v4 adds encrypted_blobs(repo_id, oid) with cid and recipients column plus index. Schema v5 drops recipients column, adds recipients_tag with non-null default. Db impl adds record_encrypted_blob upsert, list_all_encrypted_blobs, encrypted_blob_cid, and encrypted_blob_recipients_tag read methods.
REST API endpoints for encrypted blob operations
crates/gitlawb-node/src/api/encrypted.rs, crates/gitlawb-node/src/api/mod.rs, crates/gitlawb-node/src/api/visibility.rs
New handlers: list_encrypted_blobs returns {oid, cid} pairs visible to caller (visibility-gated); get_encrypted_blob fetches encrypted envelope bytes from IPFS; replicate_encrypted_blobs returns replication-format JSON (no recipients); withheld_paths returns withheld and reinclude globs (whole-repo visibility gate, no reader_dids). Unit tests validate JSON schema.
HTTP router registration for encrypted blob and visibility endpoints
crates/gitlawb-node/src/server.rs
New GET routes under /api/v1/repos/{owner}/{repo}/ for withheld-paths, encrypted-blobs list/get/replicate wired to visibility and encrypted handlers.

Mirror mode classification and encrypted blob replication

Layer / File(s) Summary
Mirror mode selection and withheld-paths fetching
crates/gitlawb-node/src/sync.rs (Part 1)
Introduces MirrorMode enum (Plain vs Promisor), fetch_withheld HTTP query to origin /withheld-paths (returns None on non-success/network/parse), and classify_mirror to select strategy. Also adds ReplicaBlob/ReplicateResponse types and blobs_needing_replication to decide which envelopes need (re)replication. Reusable reqwest::Client with 30s timeout passed through batch processing.
Encrypted blob replication for mirrors (Option B2)
crates/gitlawb-node/src/sync.rs (Part 2)
New replicate_encrypted_blobs function fetches blob listings from origin, compares against local DB state by OID, pulls missing/changed envelopes via IPFS cat, validates CID match, and records rows keyed by mirror repo_id. Best-effort per-blob with logging; no-ops when IPFS unconfigured, endpoint absent, or no blobs returned.
Promisor clone/fetch with mode transitions
crates/gitlawb-node/src/sync.rs (Part 3)
Updates clone_repo and fetch_repo to accept MirrorMode, apply --filter=blob:limit=10g and promisor/partialclonefilter config for promisor mode, and detect/unset promisor settings on transition back to plain mode (refetch with --refetch --prune). Helpers: git_run (stderr on error), git_run_lenient (ignore failures), git_config_get for promisor detection. Tests cover mode classification, replication decision, and promisor↔plain transitions.
Sync worker loop integration and client construction
crates/gitlawb-node/src/sync.rs (Part 4)
Per-item sync calls fetch_withheld, classifies mirror mode, routes to clone/fetch variants, then calls replicate_encrypted_blobs. Signature changes to process_batch accept client parameter.

Client-side clone command with recovery paths

Layer / File(s) Summary
CLI parsing, partial clone, and sparse checkout configuration
crates/gl/src/clone.rs, crates/gl/src/main.rs
New CloneArgs struct. setup_partial_clone clones with --filter=blob:none --no-checkout, writes non-cone sparse-checkout excluding withheld globs and re-including nested allowed paths, performs branch checkout (defaults to origin/HEAD). parse_repo normalizes gitlawb:/// and bare / formats (supports DID owners with colons). Main adds Clone subcommand routing.
Node-based encrypted blob recovery via HTTP
crates/gl/src/clone.rs (Part 1)
fetch_withheld queries node /withheld-paths, selecting signed vs unsigned endpoint based on local key, returning empty globs on 404/501. recover_encrypted_blobs fetches /encrypted-blobs list and per-blob envelopes, decrypts via open_blob, installs via hash-object -w, verifies OID match, records repo-relative paths. Best-effort per-blob error handling.
Arweave/IPFS fallback recovery with manifest discovery
crates/gl/src/clone.rs (Part 2)
Types: ManifestBlob, Manifest for deserialization. parse_tx_ids extracts Arweave GraphQL transaction ids. recover_from_arweave discovers encrypted manifest transactions via GraphQL, fetches/parses each manifest, merges by latest RFC3339 timestamp, then for each missing object (guarded by GIT_NO_LAZY_FETCH) downloads envelope from IPFS CID endpoint, decrypts with caller key, installs/verifies blob OIDs. Best-effort per-blob; skips on gateway/parse/timeout failures.
Clone flow orchestration and sparse-checkout materialization
crates/gl/src/clone.rs (Part 3)
run() function validates destination non-existence, fetches withheld/reinclude globs, performs partial clone setup, conditionally runs node-based recovery then Arweave fallback (when keypair available), updates sparse-checkout to re-materialize recovered paths for sparse clones, attempts checkout, prints messages. Comprehensive tests cover sparse patterns, response parsing, manifest merging, and end-to-end Arweave/IPFS scenarios with mocked gateways.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

  • Gitlawb/node#42: Raised Phase 3 withholding issues with incomplete withheld sets from ref-namespace scope (heads/tags only) and fail-open ref traversal that could leak cleartext blobs; this PR's visibility_pack.rs implementation enumerates blobs only from heads/tags and skips ref traversal errors silently, directly matching the issue's failure modes.
  • Gitlawb/node#41: Calls for hardening Keypair::seed_bytes() to return a zeroizing wrapper; this PR exposes the method for X25519 derivation but does not yet zeroize the returned bytes, leaving the seed in memory.

Possibly related PRs

  • Gitlawb/node#25: Introduced path-scoped visibility rules and visibility_check logic; this PR extends that work by adding withheld_globs/reincluded_globs helpers and a new /withheld-paths endpoint that gate encrypted blob withholding and replication.

Suggested reviewers

  • kevincodex1

Poem

🐰 Blobs wrapped in X25519 envelopes bright,
Withheld by the rules, hidden from sight,
Pinned secret to IPFS, sealed tight,
Arweave anchors guide the recovery flight,
Clone sparse and recover—all's right! 🔐

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main fix: closing under-withholding by improving ref scope coverage and using reachable-only pin sets.
Linked Issues check ✅ Passed All coding requirements from issue #42 are met: full-ref enumeration replaces heads/tags-only walk in visibility_pack.rs, ref-walk failures now fail closed, and pin enumerators are aligned with reachable graph via git rev-list.
Out of Scope Changes check ✅ Passed All changes directly support fixing the incomplete withheld set vulnerability. Encryption/decryption support and new APIs are necessary infrastructure for the Phase 2/3 replication and withholding system being deployed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (4)
crates/gl/src/clone.rs (1)

262-267: ⚡ Quick win

Parse /encrypted-blobs into a typed response instead of Value.

A missing or mistyped blobs field currently degrades to [] and silently disables node-side recovery. Making this schema-strict would catch server/client contract drift the same way WithheldPathsResponse already does.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gl/src/clone.rs` around lines 262 - 267, Replace the loose
serde_json::Value parsing for resp in clone.rs with a strongly typed
EncryptedBlobsResponse struct so schema mismatches surface as errors: define
EncryptedBlobsResponse { blobs: Vec<EncryptedBlobOrString> } (matching the
server shape, analogous to WithheldPathsResponse), then call
resp.json::<EncryptedBlobsResponse>().await.context("parsing encrypted-blobs")?
and use the returned .blobs directly instead of
body.get(...).and_then(...).cloned().unwrap_or_default(); this ensures missing
or mistyped "blobs" no longer silently becomes [] and will fail fast (update the
EncryptedBlobOrString type to the actual blob element type used).
crates/gitlawb-node/src/api/encrypted.rs (1)

109-123: ⚡ Quick win

Add a handler test for the 404/404 access-control contract.

The new test only locks the wire shape. A small handler-level test that exercises both “caller denied” and “OID missing” on get_encrypted_blob would protect the deliberate non-oracle behavior from future refactors.

Based on learnings, this endpoint should keep “not found” and “not authorized” externally indistinguishable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/api/encrypted.rs` around lines 109 - 123, Add a
handler-level test for the access-control contract of get_encrypted_blob that
verifies the external response is indistinguishable when the caller is denied
versus when the OID is missing: write two tests (or one parameterized test)
invoking the get_encrypted_blob handler — one scenario where the auth check
fails (simulate "caller denied") and one where the blob store returns NotFound
(simulate "OID missing") — and assert both responses have the same HTTP status
and body (e.g., both return 404/not-found) to ensure the endpoint does not leak
authorization state; reference the get_encrypted_blob handler in your test and
reuse existing test scaffolding in the encrypted.rs tests module.

Source: Learnings

crates/gitlawb-core/src/encrypt.rs (1)

66-88: ⚡ Quick win

Zeroize transient key material in seal_blob and open_blob.

content_key, the seed-derived X25519 secret, and the unwrapped content key all live in plain stack/heap buffers until drop. In this confidentiality path that is avoidable exposure; keep them in Zeroizing (or explicitly call zeroize()) once the ciphers are initialized.

Hardening sketch
+use zeroize::Zeroizing;
+
-    let mut content_key = [0u8; 32];
-    OsRng.fill_bytes(&mut content_key);
+    let mut content_key = Zeroizing::new([0u8; 32]);
+    OsRng.fill_bytes(&mut *content_key);
     let body_cipher = XChaCha20Poly1305::new_from_slice(&content_key)
         .map_err(|e| anyhow::anyhow!("content key: {e}"))?;
...
-    let my_x = XSecret::from(x25519_secret_from_seed(&keypair.seed_bytes()));
+    let seed = keypair.to_seed();
+    let my_x = XSecret::from(x25519_secret_from_seed(&seed));
...
-    let content_key = content_key.context("not a recipient of this envelope")?;
+    let content_key = Zeroizing::new(content_key.context("not a recipient of this envelope")?);

Also applies to: 126-167

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-core/src/encrypt.rs` around lines 66 - 88, In seal_blob and
open_blob, sensitive key material (the 32-byte content_key, the ephemeral
XSecret `eph`, the seed-derived X25519 secret, and any unwrapped content key
buffers) must be zeroed after use: wrap stack/heap buffers in zeroize::Zeroizing
(e.g., Zeroizing<[u8;32]>, Zeroizing<Vec<u8>>) or explicitly call zeroize() on
types that implement Zeroize immediately after the ciphers are initialized and
after encryption/decryption completes; ensure you zero `content_key`, the
ephemeral secret `eph` (or call its zeroize method), the seed-derived X25519
secret produced by x25519_private/related variable, and the unwrapped content
key before returning or pushing into Recipient so no plaintext key remains in
memory.
crates/gitlawb-node/src/git/visibility_pack.rs (1)

101-130: 🏗️ Heavy lift

Avoid re-traversing the repo to derive recipients.

withheld_blob_recipients() does one full blob_paths() sweep through withheld_blob_oids() and then a second full blob_paths() sweep for the recipient pass. On the push path, git_receive_pack() has already computed the withheld set before it gets here, so repos with visibility rules now pay up to three complete ls-tree traversals per push. Threading the precomputed withheld/blob-path data into this API would keep recipient derivation single-pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/git/visibility_pack.rs` around lines 101 - 130,
withheld_blob_recipients currently re-traverses the repo by calling
withheld_blob_oids(...) and then blob_paths(...), causing multiple ls-tree
sweeps; change withheld_blob_recipients signature to accept precomputed withheld
data (e.g., Option<&HashSet<String>> or Option<&HashMap<String,String>>
representing withheld oids or oid->path map) so callers like git_receive_pack
can pass the already-computed set/map, skip the internal withheld_blob_oids call
when provided, and iterate the provided oid->path mapping (or filter blob_paths
once) to build recipients; update callers (notably git_receive_pack) to supply
the precomputed data and keep behavior identical when None is passed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gitlawb-core/src/identity.rs`:
- Around line 55-59: The new seed_bytes() method leaks a non-zeroizing copy of
the Ed25519 seed; replace its usage with the existing to_seed() which returns a
Zeroizing<[u8;32]>, or change seed_bytes() to return a zeroizing wrapper (e.g.
Zeroizing<[u8;32]>) instead of plain [u8;32]; update callers such as the code
paths that call seed_bytes() (including open_blob and the encrypted-pin task in
crates/gitlawb-node/src/api/repos.rs) to accept and use the Zeroizing seed or to
call to_seed() directly so the raw seed is never held in plain memory.

In `@crates/gitlawb-node/src/api/repos.rs`:
- Around line 398-425: Short-circuit the expensive withheld walk by detecting
when the repo's rules cannot withhold any path-scoped blobs (e.g., rules is
empty or only applies to root) before calling
visibility_pack::withheld_blob_oids; if that check shows no possible withheld
blobs, skip the tokio::task::spawn_blocking(...) and call
smart_http::upload_pack(&disk_path, body).await directly, otherwise proceed to
spawn_blocking and then choose between smart_http::upload_pack and
smart_http::upload_pack_excluding based on the resulting withheld set; use the
existing variables and functions (rules, visibility_pack::withheld_blob_oids,
withheld, smart_http::upload_pack, smart_http::upload_pack_excluding) to
implement the early return.

In `@crates/gitlawb-node/src/encrypted_pin.rs`:
- Around line 77-80: Replace silent `continue` branches when
`crate::git::store::read_object(repo_path, oid)` and the IPFS pin operation fail
with warning logs that include contextual details (repo_path, oid, and the
error/result) so failures are observable; in the match arms around `let data =
match crate::git::store::read_object(repo_path, oid) { ... }` and the subsequent
IPFS pin match, call the module’s logger (e.g., process_logger.warn or
log::warn) with a concise message and the error/result, then continue as before
to preserve behavior.
- Around line 72-76: The current code builds `keys: Vec<VerifyingKey>` by
`filter_map(|d| did_to_key(d))`, which silently drops unresolved DIDs and
proceeds to encrypt to a partial recipient set; change this so any unresolved
DID causes the blob to be skipped: instead of `filter_map`, resolve all DIDs and
detect failures (e.g., collect Options or compare `keys.len()` to `dids.len()`
after mapping via `did_to_key`), and when a mismatch is found log a warning
referencing `oid` (similar to the existing log) and `continue` so you do not
seal to a partial set; ensure you keep the `VerifyingKey` handling consistent
with downstream code that expects `keys`.

In `@crates/gitlawb-node/src/ipfs_pin.rs`:
- Around line 75-85: The cat function creates a reqwest::Client with no timeout
(reqwest::Client::new()) which can hang get_encrypted_blob and the sync worker;
modify cat (or callers) to use a client with a bounded timeout (e.g., build a
Client via reqwest::Client::builder().timeout(Duration::from_secs(...)) and
reuse it) or apply a per-request timeout (e.g., wrap the send/bytes await with
tokio::time::timeout) so the POST to {}/api/v0/cat?arg={} fails fast on
slow/stalled Kubo; update the cat signature or module to accept/reuse a Client
or enforce a timeout around resp = client.post(&url).send().await and
resp.bytes().await to avoid indefinite hangs.

In `@crates/gitlawb-node/src/sync.rs`:
- Around line 181-188: The code currently treats a None result from
fetch_withheld as MirrorMode::Plain which incorrectly downgrades unknown/error
states; instead, check the Option returned by fetch_withheld and only set
MirrorMode::Plain when you get Some(empty); if fetch_withheld returns None
preserve the repository's existing promisor/partialclone state (do not call
classify_mirror on None). Concretely, change the branch around
fetch_withheld/classify_mirror to: if withheld.is_some() { let mode =
classify_mirror(withheld); } else { read the current repo promisor/partialclone
setting from the local repo at local_path (or otherwise keep the current
MirrorMode) and use that mode for fetch_repo/clone_repo }, so
fetch_repo/clone_repo are invoked with the preserved mode rather than
unconditionally using MirrorMode::Plain.

In `@crates/gl/src/clone.rs`:
- Around line 428-459: The current GraphQL fetch (the local `query` +
`gql_body`, use of `client.post(...).json(&gql_body).send().await`, and
subsequent `parse_tx_ids(&gql)`) only requests first:100 and stops, causing
missing manifests; change the logic to paginate: rewrite the GraphQL query to
take variables for "first" and "after" (cursor), then loop POSTing the query
(building `gql_body` with `{ "repo": slug, "first": batch_size, "after": cursor
}`), call `parse_tx_ids` (or adjust it to return edges + pageInfo) to collect tx
ids across pages, update `cursor` from pageInfo.nextCursor/hasNextPage until no
more pages, then proceed to fetch manifests and call
`merge_manifests(manifests)` as before; ensure the loop uses the same
`client.post(...).json(&gql_body).send().await` error handling as existing code
and retains the existing behavior when responses fail.
- Around line 415-416: The slug normalization and Arweave discovery/parsing need
tightening: ensure owner short-form uses the same split logic as the node by
keeping owner_short = owner.split(':').next_back().unwrap_or(owner) and forming
slug = format!("{owner_short}/{name}"); update the GraphQL discovery for
encrypted-manifest anchors in discover/recovery code to paginate instead of
hard-coding first:100 (iterate queries until no more edges or timestamps
covered) so recover_encrypted_blobs sees all anchors; change
recover_encrypted_blobs to deserialize the /encrypted-blobs response into a
strict typed struct (like the existing withheld-paths type) and fail on
missing/wrong fields instead of using Value with unwrap_or_default(), ensuring
schema-strict parsing and surfaced errors.

---

Nitpick comments:
In `@crates/gitlawb-core/src/encrypt.rs`:
- Around line 66-88: In seal_blob and open_blob, sensitive key material (the
32-byte content_key, the ephemeral XSecret `eph`, the seed-derived X25519
secret, and any unwrapped content key buffers) must be zeroed after use: wrap
stack/heap buffers in zeroize::Zeroizing (e.g., Zeroizing<[u8;32]>,
Zeroizing<Vec<u8>>) or explicitly call zeroize() on types that implement Zeroize
immediately after the ciphers are initialized and after encryption/decryption
completes; ensure you zero `content_key`, the ephemeral secret `eph` (or call
its zeroize method), the seed-derived X25519 secret produced by
x25519_private/related variable, and the unwrapped content key before returning
or pushing into Recipient so no plaintext key remains in memory.

In `@crates/gitlawb-node/src/api/encrypted.rs`:
- Around line 109-123: Add a handler-level test for the access-control contract
of get_encrypted_blob that verifies the external response is indistinguishable
when the caller is denied versus when the OID is missing: write two tests (or
one parameterized test) invoking the get_encrypted_blob handler — one scenario
where the auth check fails (simulate "caller denied") and one where the blob
store returns NotFound (simulate "OID missing") — and assert both responses have
the same HTTP status and body (e.g., both return 404/not-found) to ensure the
endpoint does not leak authorization state; reference the get_encrypted_blob
handler in your test and reuse existing test scaffolding in the encrypted.rs
tests module.

In `@crates/gitlawb-node/src/git/visibility_pack.rs`:
- Around line 101-130: withheld_blob_recipients currently re-traverses the repo
by calling withheld_blob_oids(...) and then blob_paths(...), causing multiple
ls-tree sweeps; change withheld_blob_recipients signature to accept precomputed
withheld data (e.g., Option<&HashSet<String>> or Option<&HashMap<String,String>>
representing withheld oids or oid->path map) so callers like git_receive_pack
can pass the already-computed set/map, skip the internal withheld_blob_oids call
when provided, and iterate the provided oid->path mapping (or filter blob_paths
once) to build recipients; update callers (notably git_receive_pack) to supply
the precomputed data and keep behavior identical when None is passed.

In `@crates/gl/src/clone.rs`:
- Around line 262-267: Replace the loose serde_json::Value parsing for resp in
clone.rs with a strongly typed EncryptedBlobsResponse struct so schema
mismatches surface as errors: define EncryptedBlobsResponse { blobs:
Vec<EncryptedBlobOrString> } (matching the server shape, analogous to
WithheldPathsResponse), then call
resp.json::<EncryptedBlobsResponse>().await.context("parsing encrypted-blobs")?
and use the returned .blobs directly instead of
body.get(...).and_then(...).cloned().unwrap_or_default(); this ensures missing
or mistyped "blobs" no longer silently becomes [] and will fail fast (update the
EncryptedBlobOrString type to the actual blob element type used).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fab81d81-dc64-46fe-b486-709131c8a108

📥 Commits

Reviewing files that changed from the base of the PR and between a2217bf and cc30427.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • crates/gitlawb-core/Cargo.toml
  • crates/gitlawb-core/src/encrypt.rs
  • crates/gitlawb-core/src/identity.rs
  • crates/gitlawb-core/src/lib.rs
  • crates/gitlawb-node/Cargo.toml
  • crates/gitlawb-node/src/api/encrypted.rs
  • crates/gitlawb-node/src/api/mod.rs
  • crates/gitlawb-node/src/api/repos.rs
  • crates/gitlawb-node/src/api/visibility.rs
  • crates/gitlawb-node/src/arweave.rs
  • crates/gitlawb-node/src/db/mod.rs
  • crates/gitlawb-node/src/encrypted_pin.rs
  • crates/gitlawb-node/src/git/mod.rs
  • crates/gitlawb-node/src/git/smart_http.rs
  • crates/gitlawb-node/src/git/visibility_pack.rs
  • crates/gitlawb-node/src/ipfs_pin.rs
  • crates/gitlawb-node/src/main.rs
  • crates/gitlawb-node/src/pinata.rs
  • crates/gitlawb-node/src/server.rs
  • crates/gitlawb-node/src/sync.rs
  • crates/gitlawb-node/src/visibility.rs
  • crates/gl/src/clone.rs
  • crates/gl/src/main.rs

Comment on lines +55 to +59
/// The raw 32-byte Ed25519 seed. Used to derive the X25519 secret for
/// envelope decryption (see `crate::encrypt`).
pub fn seed_bytes(&self) -> [u8; 32] {
self.signing_key.to_bytes()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid adding a non-zeroizing raw-seed accessor.

to_seed() already exposes this material behind Zeroizing; seed_bytes() reintroduces the same long-lived signing seed as a plain [u8; 32]. That copy now flows into open_blob and the async encrypted-pin path, which widens private-key lifetime without adding any new capability. Prefer reusing to_seed() here, or make this helper return a zeroizing wrapper instead of raw bytes. This already shows up in crates/gitlawb-node/src/api/repos.rs:659-707, where the seed is moved into the spawned encrypted-pin task.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-core/src/identity.rs` around lines 55 - 59, The new
seed_bytes() method leaks a non-zeroizing copy of the Ed25519 seed; replace its
usage with the existing to_seed() which returns a Zeroizing<[u8;32]>, or change
seed_bytes() to return a zeroizing wrapper (e.g. Zeroizing<[u8;32]>) instead of
plain [u8;32]; update callers such as the code paths that call seed_bytes()
(including open_blob and the encrypted-pin task in
crates/gitlawb-node/src/api/repos.rs) to accept and use the Zeroizing seed or to
call to_seed() directly so the raw seed is never held in plain memory.

Comment on lines +398 to +425
// withheld_blob_oids walks every ref with blocking `git ls-tree`; keep that
// off the async worker thread.
let withheld = {
let path = disk_path.clone();
let rules = rules.clone();
let owner_did = record.owner_did.clone();
let caller_owned = caller.map(str::to_string);
let is_public = record.is_public;
tokio::task::spawn_blocking(move || {
visibility_pack::withheld_blob_oids(
&path,
&rules,
is_public,
&owner_did,
caller_owned.as_deref(),
)
})
.await
.map_err(|e| {
let msg = e.to_string();
if msg.contains("bad line length") || msg.contains("protocol error") {
tracing::warn!(repo = %name, err = %msg, "git-upload-pack: bad client request");
AppError::BadRequest(msg)
} else {
tracing::error!(repo = %name, err = %msg, "git-upload-pack failed");
AppError::Git(msg)
}
})?;
.map_err(|e| AppError::Git(e.to_string()))?
.map_err(|e| AppError::Git(e.to_string()))?
};

let resp = if withheld.is_empty() {
smart_http::upload_pack(&disk_path, body).await
} else {
tracing::info!(repo = %name, caller = ?caller, withheld = withheld.len(), "serving filtered pack");
smart_http::upload_pack_excluding(&disk_path, body, &withheld).await
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip the withheld walk when no path-scoped rule can withhold anything.

After the "/" gate above, repos whose rule set is empty or root-only cannot produce a non-empty withheld blob set, but this branch still runs the full withheld_blob_oids() ref walk before falling back to the ordinary upload_pack(). That adds an extra full repo scan to every clone/fetch on the common public path.

💡 Minimal short-circuit
-    let withheld = {
-        let path = disk_path.clone();
-        let rules = rules.clone();
-        let owner_did = record.owner_did.clone();
-        let caller_owned = caller.map(str::to_string);
-        let is_public = record.is_public;
-        tokio::task::spawn_blocking(move || {
-            visibility_pack::withheld_blob_oids(
-                &path,
-                &rules,
-                is_public,
-                &owner_did,
-                caller_owned.as_deref(),
-            )
-        })
-        .await
-        .map_err(|e| AppError::Git(e.to_string()))?
-        .map_err(|e| AppError::Git(e.to_string()))?
-    };
-
-    let resp = if withheld.is_empty() {
+    let has_path_scoped_rules = rules.iter().any(|r| r.path_glob != "/");
+    let resp = if !has_path_scoped_rules {
         smart_http::upload_pack(&disk_path, body).await
     } else {
-        tracing::info!(repo = %name, caller = ?caller, withheld = withheld.len(), "serving filtered pack");
-        smart_http::upload_pack_excluding(&disk_path, body, &withheld).await
+        let withheld = {
+            let path = disk_path.clone();
+            let rules = rules.clone();
+            let owner_did = record.owner_did.clone();
+            let caller_owned = caller.map(str::to_string);
+            let is_public = record.is_public;
+            tokio::task::spawn_blocking(move || {
+                visibility_pack::withheld_blob_oids(
+                    &path,
+                    &rules,
+                    is_public,
+                    &owner_did,
+                    caller_owned.as_deref(),
+                )
+            })
+            .await
+            .map_err(|e| AppError::Git(e.to_string()))?
+            .map_err(|e| AppError::Git(e.to_string()))?
+        };
+        if withheld.is_empty() {
+            smart_http::upload_pack(&disk_path, body).await
+        } else {
+            tracing::info!(repo = %name, caller = ?caller, withheld = withheld.len(), "serving filtered pack");
+            smart_http::upload_pack_excluding(&disk_path, body, &withheld).await
+        }
     }
     .map_err(|e| {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/api/repos.rs` around lines 398 - 425, Short-circuit
the expensive withheld walk by detecting when the repo's rules cannot withhold
any path-scoped blobs (e.g., rules is empty or only applies to root) before
calling visibility_pack::withheld_blob_oids; if that check shows no possible
withheld blobs, skip the tokio::task::spawn_blocking(...) and call
smart_http::upload_pack(&disk_path, body).await directly, otherwise proceed to
spawn_blocking and then choose between smart_http::upload_pack and
smart_http::upload_pack_excluding based on the resulting withheld set; use the
existing variables and functions (rules, visibility_pack::withheld_blob_oids,
withheld, smart_http::upload_pack, smart_http::upload_pack_excluding) to
implement the early return.

Comment on lines +72 to +76
let keys: Vec<VerifyingKey> = dids.iter().filter_map(|d| did_to_key(d)).collect();
if keys.is_empty() {
tracing::warn!(oid = %oid, "no resolvable recipient keys; skipping encrypted pin");
continue;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not encrypt to a partial recipient set when some DIDs fail key resolution.

At Line 72, filter_map drops unresolved DIDs and still seals for the remaining keys. That can make authorized readers unrecoverable even though the blob is recorded as successfully encrypted/pinned. Treat any unresolved DID as a per-blob failure and skip instead of partially sealing.

💡 Suggested fix
-        let keys: Vec<VerifyingKey> = dids.iter().filter_map(|d| did_to_key(d)).collect();
-        if keys.is_empty() {
-            tracing::warn!(oid = %oid, "no resolvable recipient keys; skipping encrypted pin");
+        let mut keys = Vec::with_capacity(dids.len());
+        let mut unresolved = Vec::new();
+        for did in dids {
+            match did_to_key(did) {
+                Some(k) => keys.push(k),
+                None => unresolved.push(did.clone()),
+            }
+        }
+        if !unresolved.is_empty() {
+            tracing::warn!(oid = %oid, unresolved = ?unresolved, "some recipient DIDs could not be resolved; skipping encrypted pin");
+            continue;
+        }
+        if keys.is_empty() {
+            tracing::warn!(oid = %oid, "empty recipient set; skipping encrypted pin");
             continue;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/encrypted_pin.rs` around lines 72 - 76, The current
code builds `keys: Vec<VerifyingKey>` by `filter_map(|d| did_to_key(d))`, which
silently drops unresolved DIDs and proceeds to encrypt to a partial recipient
set; change this so any unresolved DID causes the blob to be skipped: instead of
`filter_map`, resolve all DIDs and detect failures (e.g., collect Options or
compare `keys.len()` to `dids.len()` after mapping via `did_to_key`), and when a
mismatch is found log a warning referencing `oid` (similar to the existing log)
and `continue` so you do not seal to a partial set; ensure you keep the
`VerifyingKey` handling consistent with downstream code that expects `keys`.

Comment on lines +77 to +80
let data = match crate::git::store::read_object(repo_path, oid) {
Ok(Some((_t, bytes))) => bytes,
_ => continue,
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add warnings for object-read and IPFS pin failures instead of silent skips.

At Line 77 and Line 88, failures currently continue without logs. This makes production retry/debugging hard and weakens observability during partial outages.

💡 Suggested fix
-        let data = match crate::git::store::read_object(repo_path, oid) {
-            Ok(Some((_t, bytes))) => bytes,
-            _ => continue,
-        };
+        let data = match crate::git::store::read_object(repo_path, oid) {
+            Ok(Some((_t, bytes))) => bytes,
+            Ok(None) => {
+                tracing::warn!(oid = %oid, "blob object not found locally; skipping encrypted pin");
+                continue;
+            }
+            Err(e) => {
+                tracing::warn!(oid = %oid, err = %e, "failed reading blob object; skipping encrypted pin");
+                continue;
+            }
+        };
@@
-        let cid = match crate::ipfs_pin::pin_git_object(ipfs_api, oid, &envelope).await {
-            Ok(c) if !c.is_empty() => c,
-            _ => continue,
-        };
+        let cid = match crate::ipfs_pin::pin_git_object(ipfs_api, oid, &envelope).await {
+            Ok(c) if !c.is_empty() => c,
+            Ok(_) => {
+                tracing::warn!(oid = %oid, "pin_git_object returned empty CID; skipping record");
+                continue;
+            }
+            Err(e) => {
+                tracing::warn!(oid = %oid, err = %e, "pin_git_object failed; skipping encrypted pin");
+                continue;
+            }
+        };

Also applies to: 88-90

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/encrypted_pin.rs` around lines 77 - 80, Replace
silent `continue` branches when `crate::git::store::read_object(repo_path, oid)`
and the IPFS pin operation fail with warning logs that include contextual
details (repo_path, oid, and the error/result) so failures are observable; in
the match arms around `let data = match
crate::git::store::read_object(repo_path, oid) { ... }` and the subsequent IPFS
pin match, call the module’s logger (e.g., process_logger.warn or log::warn)
with a concise message and the error/result, then continue as before to preserve
behavior.

Comment on lines +75 to +85
/// Fetch raw bytes for a CID from the local Kubo node (`/api/v0/cat`).
pub async fn cat(ipfs_api: &str, cid: &str) -> Result<Vec<u8>> {
if ipfs_api.is_empty() {
return Err(anyhow::anyhow!("IPFS not configured"));
}
let url = format!("{}/api/v0/cat?arg={}", ipfs_api.trim_end_matches('/'), cid);
let resp = reqwest::Client::new().post(&url).send().await?;
if !resp.status().is_success() {
return Err(anyhow::anyhow!("ipfs cat {cid}: {}", resp.status()));
}
Ok(resp.bytes().await?.to_vec())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound the Kubo /cat request.

This new helper creates a fresh reqwest::Client with no timeout, so a slow or stalled Kubo fetch can hang both get_encrypted_blob and the sync worker indefinitely. Please reuse a client with a bounded timeout, or set a per-request timeout here.

⏱️ Suggested fix
 pub async fn cat(ipfs_api: &str, cid: &str) -> Result<Vec<u8>> {
     if ipfs_api.is_empty() {
         return Err(anyhow::anyhow!("IPFS not configured"));
     }
     let url = format!("{}/api/v0/cat?arg={}", ipfs_api.trim_end_matches('/'), cid);
-    let resp = reqwest::Client::new().post(&url).send().await?;
+    let resp = reqwest::Client::new()
+        .post(&url)
+        .timeout(std::time::Duration::from_secs(30))
+        .send()
+        .await?;
     if !resp.status().is_success() {
         return Err(anyhow::anyhow!("ipfs cat {cid}: {}", resp.status()));
     }
     Ok(resp.bytes().await?.to_vec())
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/ipfs_pin.rs` around lines 75 - 85, The cat function
creates a reqwest::Client with no timeout (reqwest::Client::new()) which can
hang get_encrypted_blob and the sync worker; modify cat (or callers) to use a
client with a bounded timeout (e.g., build a Client via
reqwest::Client::builder().timeout(Duration::from_secs(...)) and reuse it) or
apply a per-request timeout (e.g., wrap the send/bytes await with
tokio::time::timeout) so the POST to {}/api/v0/cat?arg={} fails fast on
slow/stalled Kubo; update the cat signature or module to accept/reuse a Client
or enforce a timeout around resp = client.post(&url).send().await and
resp.bytes().await to avoid indefinite hangs.

Comment on lines +181 to 188
let withheld = fetch_withheld(client, &origin_url, owner_short, repo_name).await;
let mode = classify_mirror(withheld);

let result = if local_path.exists() {
fetch_repo(&local_path, &remote_url).await
fetch_repo(&local_path, &remote_url, mode).await
} else {
clone_repo(&remote_url, &local_path).await
clone_repo(&remote_url, &local_path, mode).await
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t downgrade an unknown mirror state to Plain.

fetch_withheld collapses 404, 5xx, network, and parse failures into None, and this branch always turns that into MirrorMode::Plain. For an existing promisor mirror, that flows into fetch_repo(..., MirrorMode::Plain), which clears remote.origin.promisor and remote.origin.partialclonefilter before refetching. A transient withheld-paths outage on a still-withheld repo will therefore start failing syncs until the endpoint recovers. Only downgrade on Some(empty), and preserve the current promisor mode when the lookup is unknown.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/sync.rs` around lines 181 - 188, The code currently
treats a None result from fetch_withheld as MirrorMode::Plain which incorrectly
downgrades unknown/error states; instead, check the Option returned by
fetch_withheld and only set MirrorMode::Plain when you get Some(empty); if
fetch_withheld returns None preserve the repository's existing
promisor/partialclone state (do not call classify_mirror on None). Concretely,
change the branch around fetch_withheld/classify_mirror to: if
withheld.is_some() { let mode = classify_mirror(withheld); } else { read the
current repo promisor/partialclone setting from the local repo at local_path (or
otherwise keep the current MirrorMode) and use that mode for
fetch_repo/clone_repo }, so fetch_repo/clone_repo are invoked with the preserved
mode rather than unconditionally using MirrorMode::Plain.

Comment thread crates/gl/src/clone.rs
Comment on lines +415 to +416
let owner_short = owner.split(':').next_back().unwrap_or(owner);
let slug = format!("{owner_short}/{name}");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Client-side Arweave lookup normalization:"
rg -n -C2 'owner_short|slug = format!|graphql' crates/gl/src/clone.rs

echo
echo "Server-side encrypted-manifest tag construction:"
rg -n -C4 'encrypted-manifest|Schema|Repo|App-Name' crates/gitlawb-node/src/arweave.rs crates/gitlawb-node/src/api/repos.rs

Repository: Gitlawb/node

Length of output: 20700


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Find EncryptedManifest and where `manifest.repo` is set:"
rg -n "struct EncryptedManifest|manifest\.repo|EncryptedManifest<'_>" crates/gitlawb-node/src/arweave.rs

echo
echo "Search for `repo:` field construction / slug derivation in node code:"
rg -n "Repo\s*[:=]|repo\s*:\s*|format!\(.*owner.*name|owner.*split\(':'\)|next_back\(\)" crates/gitlawb-node/src/arweave.rs crates/gitlawb-node/src -S

Repository: Gitlawb/node

Length of output: 10904


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== crates/gitlawb-node/src/arweave.rs (top + EncryptedManifest + tag header) ==="
nl -ba crates/gitlawb-node/src/arweave.rs | sed -n '1,260p'

echo
echo "=== crates/gl/src/clone.rs (Arweave fallback slug + GraphQL variables) ==="
nl -ba crates/gl/src/clone.rs | sed -n '360,520p'

Repository: Gitlawb/node

Length of output: 199


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "arweave.rs lines:"
wc -l crates/gitlawb-node/src/arweave.rs
echo
echo "crates/gitlawb-node/src/arweave.rs: EncryptedManifest + manifest_json payload + build_manifest_tags_header + any construction sites:"
cat -n crates/gitlawb-node/src/arweave.rs | sed -n '1,260p'

echo
echo "clone.rs: Arweave fallback slug + GraphQL body:"
wc -l crates/gl/src/clone.rs
echo
cat -n crates/gl/src/clone.rs | sed -n '380,480p'

Repository: Gitlawb/node

Length of output: 15653


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Call sites for anchor_encrypted_manifest / EncryptedManifest construction ==="
rg -n "anchor_encrypted_manifest|EncryptedManifest\s*\{" crates/gitlawb-node/src

echo
echo "=== Inspect around likely repo_slug construction (owner_short + repo_name) in api/repos.rs ==="
rg -n "owner_short\s*=.*split\(':'\)\.next_back\(\)\.unwrap_or|repo_slug\s*=\s*format!" crates/gitlawb-node/src/api/repos.rs -n
sed -n '650,820p' crates/gitlawb-node/src/api/repos.rs | cat -n

Repository: Gitlawb/node

Length of output: 9929


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clone.rs: encrypted-blobs parsing ==="
rg -n "/encrypted-blobs|encrypted-blobs|recover_encrypted_blobs|serde_json::Value" crates/gl/src/clone.rs
echo
echo "=== clone.rs: recover_encrypted_blobs implementation ==="
# Print the function body area (best-effort) around its definition
rg -n "async fn recover_encrypted_blobs" -n crates/gl/src/clone.rs | head -n 5
# Use line ranges around likely area
sed -n '1,220p' crates/gl/src/clone.rs | head -n 220

echo
echo "=== clone.rs: fetch_withheld + schema strictness for withheld-paths ==="
rg -n "withheld-paths|fetch_withheld|recover_from_arweave" crates/gl/src/clone.rs

Repository: Gitlawb/node

Length of output: 9912


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# encrypted-blobs + parsing behavior
echo "=== clone.rs: recover_encrypted_blobs + parse_tx_ids ==="
sed -n '220,420p' crates/gl/src/clone.rs

echo
echo "=== clone.rs: WithheldPathsResponse parsing (schema strictness / fail-open) ==="
sed -n '180,260p' crates/gl/src/clone.rs

echo
echo "=== clone.rs: tests relevant to strictness / arweave recovery ==="
rg -n "withheld-paths|encrypted-blobs|recover_from_arweave|unauthorized|happy|schema" crates/gl/src/clone.rs

Repository: Gitlawb/node

Length of output: 11990


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '740,820p' crates/gl/src/clone.rs

Repository: Gitlawb/node

Length of output: 3031


Fix Arweave manifest discovery pagination; encrypted-blobs parsing should be schema-strict

  • Arweave Repo slug normalization matches: the client uses {last-':' owner segment}/{name}, and the node anchors EncryptedManifest.repo with the same owner_did.split(':').next_back() short-form, tagging Repo:{manifest.repo}.
  • Arweave GraphQL discovery hard-codes first:100; if a repo has >100 encrypted-manifest anchors under the same Repo tag, recovery can miss the newest per-blob manifests and leave still-live blobs unrecovered—needs pagination/iteration until coverage is complete.
  • /encrypted-blobs parsing is fail-open on schema drift: recover_encrypted_blobs uses Value and defaults blobs to [] via unwrap_or_default(), unlike /withheld-paths which deserializes a typed struct and errors on missing/wrong fields; switch /encrypted-blobs to a strict typed response and fail on missing/wrong types.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gl/src/clone.rs` around lines 415 - 416, The slug normalization and
Arweave discovery/parsing need tightening: ensure owner short-form uses the same
split logic as the node by keeping owner_short =
owner.split(':').next_back().unwrap_or(owner) and forming slug =
format!("{owner_short}/{name}"); update the GraphQL discovery for
encrypted-manifest anchors in discover/recovery code to paginate instead of
hard-coding first:100 (iterate queries until no more edges or timestamps
covered) so recover_encrypted_blobs sees all anchors; change
recover_encrypted_blobs to deserialize the /encrypted-blobs response into a
strict typed struct (like the existing withheld-paths type) and fail on
missing/wrong fields instead of using Value with unwrap_or_default(), ensuring
schema-strict parsing and surfaced errors.

Comment thread crates/gl/src/clone.rs
Comment on lines +428 to +459
let query = r#"query($repo:String!){transactions(tags:[{name:"App-Name",values:["gitlawb"]},{name:"Schema",values:["gitlawb/encrypted-manifest/v1"]},{name:"Repo",values:[$repo]}],first:100){edges{node{id}}}}"#;
let gql_body = serde_json::json!({ "query": query, "variables": { "repo": slug } });
let resp = match client
.post(format!("{ag}/graphql"))
.json(&gql_body)
.send()
.await
{
Ok(r) if r.status().is_success() => r,
_ => return Ok(vec![]),
};
let gql: serde_json::Value = match resp.json().await {
Ok(v) => v,
Err(_) => return Ok(vec![]),
};
let tx_ids = parse_tx_ids(&gql);
if tx_ids.is_empty() {
return Ok(vec![]);
}

// 2. Fetch and parse each manifest body, then merge latest-wins per oid.
let mut manifests = Vec::new();
for tx in tx_ids {
let m = match client.get(format!("{ag}/{tx}")).send().await {
Ok(r) if r.status().is_success() => r,
_ => continue,
};
if let Ok(parsed) = m.json::<Manifest>().await {
manifests.push(parsed);
}
}
let oid_cid = merge_manifests(manifests);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Paginate manifest discovery instead of truncating at 100 transactions.

This fallback only inspects the first 100 manifest txs, then merges per-push manifests to find the latest CID per OID. Once a repo has more than 100 pushes, any private blob whose last seal is older than that window disappears from recovery even if it is still live at HEAD.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gl/src/clone.rs` around lines 428 - 459, The current GraphQL fetch
(the local `query` + `gql_body`, use of
`client.post(...).json(&gql_body).send().await`, and subsequent
`parse_tx_ids(&gql)`) only requests first:100 and stops, causing missing
manifests; change the logic to paginate: rewrite the GraphQL query to take
variables for "first" and "after" (cursor), then loop POSTing the query
(building `gql_body` with `{ "repo": slug, "first": batch_size, "after": cursor
}`), call `parse_tx_ids` (or adjust it to return edges + pageInfo) to collect tx
ids across pages, update `cursor` from pageInfo.nextCursor/hasNextPage until no
more pages, then proceed to fetch manifests and call
`merge_manifests(manifests)` as before; ensure the loop uses the same
`client.post(...).json(&gql_body).send().await` error handling as existing code
and retains the existing behavior when responses fail.

@beardthelion

Copy link
Copy Markdown
Contributor Author

CodeRabbit's pass here spans the whole unmerged stack (this PR is base=main, so it re-reviews all prereq commits). None of the eight findings fall on this PR's actual delta, which only touches blob_paths, ipfs_pin::list_all_objects, and pinata::list_all_objects. They're routed to where they belong rather than folded in here:

@kevincodex1

Copy link
Copy Markdown
Contributor

hello @beardthelion do you want to be a maintainer of this node repository? I would like to add you as one. let me know if you are interested.

@beardthelion

Copy link
Copy Markdown
Contributor Author

Thanks @kevincodex1, I'd be glad to. I've enjoyed working on this and I'm happy to take on more responsibility for it.

Couple of things so I know what I'm signing up for: what's the expectation for the role, just merge access, or also triage/releases and general upkeep? And who else maintains it currently?

Either way, count me in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 3 withholding: incomplete withheld set can leak blobs in cleartext (ref namespace scope + fail-open ref walk)

2 participants